tests: diagnostic on release build#2177
Merged
Merged
Conversation
35f7f95 to
237e4a5
Compare
1550b0e to
9a8f8e9
Compare
utkarash2991
approved these changes
May 30, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Diagnostic-only PR for the amd64 stage-1 AMI build failure in
ami-release-nix.yml.The failing shape is:
amazon-ebssurrogateexits withScript exited with non-zero exit status: 123This PR is intentionally temporary. The goal is to capture enough evidence to identify the root cause, then remove or reduce the diagnostics.
Current working theory
The latest evidence points away from a deterministic command failure in
90-cleanup.sh.The strongest current theory is that the amd64 source build instance, SSH provisioner session, or bootstrap process is being killed or interrupted outside normal Bash control near the late Ansible / cleanup boundary. One plausible cause is memory pressure triggering OOM behavior; this is still a hypothesis, not proven.
Why this changed:
EXITtrap added to the parent bootstrap did not fire in the latest amd64 failuresfailed=0What has been tried
Round 1: instrument
90-cleanup.shAdded tracing and in-chroot logging to
scripts/90-cleanup.sh:Wrapped the parent
chroot /mnt /tmp/90-cleanup.shcall inebssurrogate/scripts/surrogate-bootstrap-nix.shto capturecleanup_rcand tail/mnt/tmp/90-cleanup.logbefore re-exiting.Result from workflow run
26598216064, job78374796167:exec,tee,ERRtrap setup, and the/tmppermission checkchmod 1777 /tmpareaERRtrap did not printInitial suspicion was
chmod 1777 /tmp, but later runs weakened that.Round 2: pre-chroot
chmodisolation testAdded a plain pre-cleanup chroot test that ran
/bin/chmod 1777 /tmpviachroot /mnt /bin/bash -c ..., without the90-cleanup.shtee/exec wrapper.Result from workflow run
26633420910, job78487682887:bash -cargumentset +edid not allow the parent shell to continue to thepre_rcechoThis invalidated the clean theory that
chmod 1777 /tmpitself was the root cause.Round 3: chroot micro-tests
Added five micro-tests before cleanup to isolate:
/mntrootfs binary presencechroot /bin/echobash -cbash -credirected to a host-side fileThese were later removed because the run shape had already shifted and the diagnostics were adding noise without surviving the failure reliably.
Round 4: parent
EXITtrapAdded a parent-shell
EXITtrap insurrogate-bootstrap-nix.shintended to fire no matter where Bash exited. It captured:Also removed the round-2/round-3 micro-tests.
Latest run checked:
26652852736.Results:
785555859487855558599078555585994EXITtrap did not print in the amd64 failuresfailed=0update_systemd_services90-cleanup.sh90-cleanup.sh/teeThis strongly suggests the failure is outside ordinary Bash error handling.
Ruled out so far
chmod 1777 /tmpfailure.#2162as the sole trigger; an amd64 build after that commit had succeeded.failed=0.Current local next diagnostic
The next local diagnostic change is scoped to
ARCH=amd64only, so it applies across PG15, PG17, and orioledb-17 amd64 builds without changing arm64 behavior.It adds:
/mnt/tmp/build-swapfilefree -hswapon --showdf -h/df -ifor key mountsvm.panic_on_oom=0kernel.panic=0execute_playbookupdate_systemd_servicesclean_systemEXITtrap output, avoiding rawdmesg, raw Ansible logs, and raw cleanup logs in GitHub outputThe intent is to test the memory-pressure/OOM hypothesis without printing likely-sensitive console/log contents into GitHub Actions.
What to look for next
EXITtrap, look next at SSH/provisioner/session termination or external runner/build-instance behavior.123, OOM behavior is likely confirmed as the failure class.